PI 6403 – Causal Inference
Department of Economics, Vienna University of Economics and Business
March 24, 2025
We define a cutoff value \(c\) and assume that treatment depends on the running variable as follows:
\[ W_i = 1\left(\{Z_i\geq c\}\right), \]
i.e., \(i\) where \(Z_i\) exceeds the cutoff are deemed treated, and all others are deemed not treated.
This treatment assignment is considered as good as random in the close vicinity of the cutoff point.
The following are examples for this kind of treatment assignment:
The running variable \(Z_i\) is a standardized test score, and all students above a cutoff \(c\) are admitted to an honors program.
The running variable \(Z_i\) is a score for severity of disease, and all patients above a cutoff \(c\) are prescribed an intervention.
The running variable \(Z_i\) is the election result of one of two parties. Districts where a certain party got 49 percent of the vote and districts where it got 51 percent of the vote should be similar in terms of covariate distributions. This can be used to investigate a possible incumbency advantage (Lee, 2008).
Why do previously considered approaches not apply in this setting?
Propensity-score methods required two assumptions, unconfoundedness and overlap:
\[ \begin{aligned} &\textcolor{var(--tertiary-color)}{\{Y_i(0),Y_i(1)\}\perp\!\!\!\!\perp W_i\mid Z_i,} \\ &\textcolor{var(--secondary-color)}{0<\mathbb{P}[W_i=1\mid Z_i]<1}. \end{aligned} \]
We thus cannot use methods that rely on division by \(\mathbb{P}[W_i=1\mid Z_i]\). Instead, we need to compare units with \(Z_i\) near the cutoff that are similar to each other, but do not have contiguous distributions.
Let \(\mu_{(w)}(z) = \mathbb{E}[Y_i(w)\mid Z_i=z]\). Then, if both \(\textcolor{var(--primary-color)}{\mu_{(0)}(z)}\) and \(\textcolor{var(--quarternary-color)}{\mu_{(1)}(z)}\) are continuous, we can identify \(\tau_c=\textcolor{var(--quarternary-color)}{\mu_{(1)}(z)}-\textcolor{var(--primary-color)}{\mu_{(0)}(z)}\), via
\[ \tau_c=\textcolor{var(--quarternary-color)}{\underset{z\downarrow c}{\mathrm{lim}}\mathbb{E}[Y_i(w)\mid Z_i=z]}-\textcolor{var(--primary-color)}{\underset{z\uparrow c}{\mathrm{lim}}\mathbb{E}[Y_i(w)\mid Z_i=z]}, \]
or in other words, as difference between the endpoints of a regression curve fitted to the right and to the left of the cutoff.
We can then estimate this using local linear regression.
We pick a small bandwidth \(h_n \rightarrow 0\) and a symmetric weighting function \(\textcolor{var(--tertiary-color)}{K(\cdot)}\) and fit \(\mu_{(w)}(z)\) via weighted linear regression on each side of the boundary:
\[ \hat{\tau}_c = \mathrm{argmin}\left\{\sum^n_{i=1}\textcolor{var(--tertiary-color)}{K\left(\frac{|Z_i-c|}{h_n}\right)}\times\textcolor{var(--secondary-color)}{\left(Y_i-a-\tau W_i-\beta_{(0)}(Z_i-c)_--\beta_{(1)}(Z_i-c)_+\right)}^2\right\} \]
where \(a\) and \(\beta_{(w)}\) are nuisance parameters.
Very generally, we can see that under the continuity assumptions mentioned before, the estimator must be consistent for reasonable choices of the bandwidth \(h_n\). To get more specific, we need a more specific smoothness assumption for \(\mu_{(0)}(z)\) and \(\mu_{(1)}(z)\).
We assume that the \(\mu_{(w)}\) are twice differentiable with a uniformly bounded second derivative \(\left|\tfrac{d^2}{dz^2}\mu_{(w)}(z)\right|\leq B\) for all \(z\in\mathbb{R}\) and \(w\in \{0,1\}\).
Proposition 8.1
Consider an RDD where the running variable has a continuous distribution around the cutoff, and \(\mathrm{Var}[Y_i\mid Z_i=z]\leq\sigma^2\) for all \(z\). Suppose furthermore that \(\left|\tfrac{d^2}{dz^2}\mu_{(w)}(z)\right|\leq B\) holds for all \(z\in\mathbb{R}\), all \(w\in \{0,1\}\) and some \(B>0\). Then, the local linear regression estimator given by
\[ \hat{\tau}_c = \mathrm{argmin}\left\{\sum^n_{i=1}K\left(\frac{|Z_i-c|}{h_n}\right)\times\left(Y_i-a-\tau W_i-\beta_{(0)}(Z_i-c)_--\beta_{(1)}(Z_i-c)_+\right)^2\right\}, \]
with bandwith \(h_n=\kappa n^{-1/5}\) for some \(\kappa>0\) is consistent, and has errors scaling as
\[ \hat{\tau}_c=\tau_c+\mathcal{O}_P(n^{-2/5}) \]
Optimized Estimation and Bias-Aware Inference
We will thus set out to look for different linear estimators. Before, we noted that we could write the local linear estimator as
\[ \hat{\tau}_c = \sum^n_{i=1}\gamma_iY_i, \]
i.e., a linear function of the outcome vector \(Y\).
We will thus set out to look for different linear estimators. Before, we noted that we could write the local linear estimator as
\[ \hat{\tau}_c = \sum^n_{i=1}\gamma_iY_i, \]
i.e., a linear function of the outcome vector \(Y\).
In a setting with homoskedastic and Gaussian errors, any linear estimator of that form, whose weights \(\gamma_i\) are only functions of the \(Z_i\), satisfies
\[ \begin{aligned} &\hat{\tau}_c (\gamma)\mid\{Z_1,\dots,Z_n\}\sim N(\hat{\tau}_c^*(\gamma),\sigma^2||\gamma||_2^2), \\ &\hat{\tau}_x (\gamma) = \sum^n_{i=1}\gamma_i\mu_{W_i}(Z_i), \end{aligned} \]
where \(W_i=1(\{Z_i>c\})\). Thus, any such estimator will be an accurate estimator for \(\tau_c\) if \(\hat{\tau}_c^*(\gamma)\approx\tau_c\) and \(||\gamma||_2^2\) is small.
The conditional variance of any linear estimator can directly be observed.
\[ \mathrm{Var}(\hat{\tau}_c(\gamma)\mid\{Z_1,\dots,Z_n\})=\sigma^2||\gamma||_2^2 \]
The bias of linear estimators depends on the unkown functions \(\mu_{(w)}(z)\) and thus cannot be observed:
\[ \mathrm{Bias}(\hat{\tau}_c(\gamma)\mid\{Z_1,\dots,Z_n\}) = \sum^n_{i=1}\gamma_i\mu_{W_i}(Z_i)-(\mu_{(1)}(c)-\mu_{(0)}(c)). \]
But, if the curvature of \(\mu_{(w)}(z)\) is still assumed bounded by \(B\), then the bias can be bounded:
\[ \begin{aligned} &\left|\mathrm{Bias}(\hat{\tau}_c(\gamma)\mid\{Z_1,\dots,Z_n\})\right|\leq I_B(\gamma) \\ &I_B(\gamma) = \mathrm{sup}\left\{\sum^n_{i=1}\gamma_i\mu_{W_i}(Z_i)-(\mu_{(1)}(c)-\mu_{(0)}(c)):|\mu''_{(w)}(z)|\leq B\right\}. \end{aligned} \]
Recall that the mean squared error (MSE) of an estimator is the sum of its variance and squared bias. Because the variance does not depend on the conditional response functions, the worst-case MSE of any linear estimator is the sum of its variance and worst-case bias squared:
\[ \mathrm{MSE}(\hat{\tau}_c(\gamma)\mid\{Z_1,\dots,Z_n\})\leq\sigma^2||\gamma||_2^2+I_B^2(\gamma). \]
Thus, assuming that \(|\mu''_{(w)}(z)|\leq B\) and conditionally on \(\{Z_1,\dots,Z_n\}\), the minimax linear estimator is the one that minimizes
\[ \hat{\tau}_c(\gamma^B)=\sum^N_{i=1}\gamma_i^BY_i,\qquad\gamma^B=\mathrm{argmin}\{\sigma^2||\gamma||_2^2+I_B^2(\gamma)\}. \]
We can solve for the weights \(\gamma_i^B\) via quadratic programming.
From before, we know that the errors of our estimator are distributed as
\[ |\mathrm{err} | \{Z_1, ..., Z_n\} \sim \mathcal{N}\bigl(\mathrm{bias}, \sigma^2 \|\gamma^B\|_2^2\bigr). \]
In addition, the optimization procedure from before yields an upper bound for the bias as by-product in terms of the optimization variable \(t\), \(|\mathrm{bias}|\leq Bt\).
We can use the information from the previous slide to build confidence intervals as follows: Because the Gaussian distribution is unimodal and symmetric,
\[ \mathbb{P}[|\mathrm{err}| \ge \zeta] \leq\mathbb{P}[|B\,t + \sigma \|\gamma^B\|_2 \, S| \ge \zeta], \quad S \sim \mathcal{N}(0,1). \]
We can then obtain \(\alpha\)-level confidence intervals like this:
\[ \begin{aligned} &\mathbb{P}[\tau_c\in\mathcal{I}_\alpha\mid\{Z_1,\dots,Z_n\}]\geq 1-\alpha,\\ &\mathcal{I}_\alpha=(\hat{\tau}_c(\gamma^B)-\zeta_\alpha^B,\hat{\tau}_c(\gamma^B)+\zeta_\alpha^B), \\ &\zeta_\alpha^B=\mathrm{inf}\{\zeta:\mathbb{P}[|Bt+\sigma||\gamma^B||_2S|>\zeta]\leq \alpha,\quad S\sim\mathcal{N}(0,1)\}. \end{aligned} \]
These confidence intervals
If we do not have Gaussian and constant-variance errors, we need to invoke a central limit theorem to argue that
\[ \hat{\tau}_c(\gamma)\mid\{Z_1,\dots,Z_n\} \approx \mathcal{N}\left(\hat{\tau}_c^*(\gamma),\sum^n_{i=1}\gamma_i^2\mathrm{Var}[Y_i\mid Z_i,W_i]\right). \]
However, if we assume that this approximation is valid, we can still get confidence intervals as before. We can also estimate the conditional variance in the previous equation via
\[ \hat{V}_n=\sum^n_{i=1}\gamma^2_i(Y_i-\hat{\mu}_{W_i}(Z_i))^2, \]
where \(\hat{\mu}_{W_i}(Z_i)\) can, e.g., be derived via local linear regression.
However, the estimator from before is not necessarily minimax under heteroskedasticity. If we use it anyways, we can build confidence intervals using the procedure on this slide, but should be aware that the estimator is motivated by a more simplified model.